# Set global chunk obptions
knitr::opts_chunk$set(fig.width=12, fig.height=12, warning = F)

# Data 
library(readxl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Tables
library(table1)
#> 
#> Attaching package: 'table1'
#> The following objects are masked from 'package:base':
#> 
#>     units, units<-

# Plots
library(ggplot2)
library(ggplotify)

# Load Metabolomics Pipeline
library(MetabolomicsPipeline)

# To upload image
library(magick)
#> Linking to ImageMagick 6.9.12.96
#> Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
#> Disabled features: fontconfig, x11

Introduction

The purpose of the MetabolomicPipeline package is to provide additional tools to complement the analysis done by Metabolon. This package allows us to create a workflow that contains six different chunks.

  1. Peak normalization and standardization

  2. Analysis data creation

  3. Data exploration

  4. Subpathway Analysis

  5. Pairwise comparisons

  6. Box plot and line plots

Data Description

In this vignette, we will use data which consists of 86 samples (42 males, 44 females), three treatment groups, and the samples were taken at three different time points. In the original experiment, the sex of each sample was not captured. Therefore, we have an additional metadata .xlsx file which contains the additional sex variable, as well as a variable called “PARENT_SAMPLE_NAME, which allows us to link the extra metadata to the sample metadata.

Normalization and Standardization

Metabolon includes normalized peak data as part of the Metabolon Excel file. We recommend you use the normalized peak data from Metabolon; therefore, the normalization and standardization steps are optional. However, in some cases, you may need to perform normalization differently than Metabolon provided. An example of this is experiments that include multiple batches. Metabolon will perform the normalization for each batch separately, which can cause downstream issues. When analyzing multiple batches we recommend combining the raw data and performing one normalization on all batches.

In its raw form, the peak data contain counts for each metabolite in each sample. Standardization and normalization are essential steps to improve the signal-to-noise ratio. We implement normalization and standardization by:

  1. Standardizing each metabolite by the metabolites median value. (median_standardization)

  2. Imputing missing values for each metabolite by the minimum value of the metabolite. (min_val_impute)

  3. Log transform each value (log_trans_met)

In the chunk below, we load the raw peak data from the Metabolon Excel sheet and perform the normalization and standardization steps. Each step only requires the output from the previous step.

###############################################################################
##### Load Data ###############################################################
###############################################################################
# Provide a path to Metabolon .xlsx file. 
metabolon_path <- "C:/Users/Owner/Box/Mouse Metabolon Data/Spinal Cord/Data/UNAZ-03-21VW SPINAL CORD DATA TABLES.XLSX" 

#  Load raw peak data
peak_data <- read_excel(metabolon_path, sheet = "Peak Area Data") 


###############################################################################
###### Standardization and Normalization ######################################
###############################################################################

# 1 Median standardization
med_std <- median_standardization(peak_data = peak_data)

# 2 Minimum Value imputation
impute <- min_val_impute(med_std)

# 3 Log Transformation
log_trans_met <- log_transformation(impute)

Analysis Data Creation

We receive multiple datasets from Metabolon, each on a different tab of the Metabolon Excel file. The datasets needed for the downstream analysis are the sample metadata, chemical annotation, and log-transformed peak data. The table below shows the tab that each dataset is on within the Metabolon Excel file.

Data Tab of excel file
Sample metadata Sample Meta Data
Chemical annotation Chemical Annotation
Log-transformed peak data Log Transformed Data

The MetabolomicsPipeline package provides a convenient way to load each of these datasets together using the “loadMetPipeFromMetabolon” function, which only requires you to provide the path to the Metabolon Excel file.

In addition to the data from the Metabolon Excel file, we will also create an analysis dataset, which contains the analysis variables from the sample metadata and the log-transformed peak data.

Create Analysis Data Steps

We take the following steps to create the analysis data.

  1. Load Data using the loadMetPipeFromMetabolon function.
  • We also load and add the additional metadata to the sample metadata.
  1. Enter metadata variables for analysis.

  2. Create the analysis data and add it to the rest of the experiment data.

  3. Create a table showing the sample distribution.

################################################################################
### Load Data ##################################################################
################################################################################

# Load Metabolon data from the Excel file
dat <- loadMetPipeFromMetabolon(metabolon_path = "../data/UNAZ-0501-22VW_ DATA TABLES.xlsx")

# load additional metadata
meta_data_additional <- read_excel("../data/AdditionalVars.xlsx")


# 2. Merge additional vars to the meta data
if(nrow(meta_data_additional)>0){
  
  meta_data <- meta_data_additional %>% 
    left_join(dat@meta,"PARENT_SAMPLE_NAME")
  
  # Update meta data slot
  dat@meta <- meta_data
}


################################################################################
### Enter Metadata Variables For Analysis ######################################
################################################################################
metadata_variables <- c("PARENT_SAMPLE_NAME", 
                        "GROUP_NAME",
                        "TIME1",
                        "Gender") 


################################################################################
### Add Analysis Data To MetPipe Object ########################################
################################################################################
# 2. Create analysis data
dat@analysis <- dat@meta %>% 
  select(all_of(metadata_variables)) %>% 
  left_join(dat@standardized_peak,"PARENT_SAMPLE_NAME") %>%
  as.data.frame()


################################################################################
### Create Table 1 #############################################################
################################################################################
# 5. Create table 1
tbl1 <- table1(~ GROUP_NAME + TIME1| Gender
               , data = dat@analysis) 

# 6. Display table 1
tbl1
Female
(N=44)
Male
(N=42)
Overall
(N=86)
GROUP_NAME
1902 15 (34.1%) 14 (33.3%) 29 (33.7%)
Control 14 (31.8%) 14 (33.3%) 28 (32.6%)
WT 15 (34.1%) 14 (33.3%) 29 (33.7%)
TIME1
End 14 (31.8%) 15 (35.7%) 29 (33.7%)
Onset 15 (34.1%) 14 (33.3%) 29 (33.7%)
PreSymp 15 (34.1%) 13 (31.0%) 28 (32.6%)


# Save data
# saveRDS(dat, file = "data/dat.Rds")

Exploratory Analysis

In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods of data exploration:

A.) Principal component analysis

B.) Heatmaps

Principal Component Analysis (PCA)

In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component (PC) accounts for the largest amount of variance within the data. The second PC accounts for the largest remaining variance, and so on. Additionally, each of the PCs produced by PCA is uncorrelated with the other principal components. PCA can allow us to visualize sources of variation in the data. The metabolite_pca function will enable us to specify a sample metadata variable to label the points in the plot.

Suppose you notice a variable with clearly separated groups that is not a variable of interest. In that case, consider stratifying your downstream analysis by the values of that variable. For example, we will stratify the downstream analysis by male/female in our vignette data.

Heatmaps

For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log normalized peak values for each metabolite in each observation. We can group the observations by the experimental conditions. Grouping the experimental conditions in a heatmap is another way of visualizing sources of variation within our data.

We can use the metabolite_heatmap function to create the heatmaps, which requires the following arguments.

  • MetPipe: This is the experiment data

  • top_mets: The number of metabolites to include in the heatmap. The metabolites are chosen based on the metabolites with the highest variability.

  • group_vars: The variables to group the samples by. caption: The title of the heatmap.

  • … : You can add additional arguments to order the samples

Exploratory analysis steps

In the chunk below, we create a PCA plot labeled by Gender. Then, we make three heatmaps increasing by complexity.

###############################################################################
### Run PCA ###################################################################
###############################################################################

# Define PCA label from metadata
meta_var = "Gender"

# Run PCA
pca <- metabolite_pca( dat,
               meta_var = meta_var)


# Show heatmap
pca


################################################################################
### Run Heatmaps ###############################################################
################################################################################

# Heatmap with one group
treat_heatmap <- metabolite_heatmap(dat,top_mets = 50,
                   group_vars = "GROUP_NAME",
                   strat_var = NULL,
                   caption = "Heatmap Arranged By Group",
                   GROUP_NAME)


as.ggplot(treat_heatmap)



# Heatmap with two groups
treatandtime <-  metabolite_heatmap(dat,top_mets = 50,
                   group_vars = c("GROUP_NAME","TIME1"),
                   strat_var = NULL,
                   caption = "Heatmap Arranged By Group and TIME",
                   GROUP_NAME, desc(TIME1))


as.ggplot(treatandtime)



# Heatmap with 2 group and stratified
 strat_heat <- metabolite_heatmap(dat,top_mets = 50,
                   group_vars = c("GROUP_NAME","TIME1"),
                   strat_var = "Gender",
                   caption = "Heatmap Arranged By Group and TIME",
                   GROUP_NAME, desc(TIME1))

 
## Female
as.ggplot(strat_heat[[1]])


# Male
as.ggplot(strat_heat[[2]])

Subpathway Analysis

In the chemical annotation file, we will see that each metabolite is within a sub-pathway, and each subpathway is within a superpathway. There are several metabolites within each subpathway and several sub-pathways within each Super-pathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level, which is already part of the Metabolon analysis. However, since multiple metabolites are within a sub-pathway, it is challenging to test if the treatment affected the peak data at the sub-pathway level. For this, we utilize a combined Fisher probability test. The combined Fisher test combines the p-values from independent tests to test the hypothesis for an overall effect. The Combined Fisher Probability is helpful for testing a model at the subpathway level based on the pvalues from the model at the metabolite level.

Combined Fished Analysis

We will test at the subpathway level by combining the p-values for each metabolite within the subpathway for each model. We use a combination function given by \(\tilde{X}\) which combines the pvalues, resulting in a chi-squared test statistic.

\[ \tilde{X} = -2\sum_{i=1}^k ln(p_i) \] where \(k\) is the number of metabolites in the subpathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\). You will notice that smaller p-values will lead to a larger \(\tilde{X}\).

Assumptions

Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,

  • Independence: Each observation is independent of all other observations. Therefore, if you have collected multiple samples from the same subject then this assumption may be violated.

  • Normality: The metabolite log-scaled intensities follow a normal distribution within each of the treatment groups.

  • Homoscedasticity: Equal variance between treatment groups.

In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumption between the metabolites within the subpathway.

For more about the Combined Fisher Probability and other methods that can address this problem, see:

Loughin, Thomas M. “A systematic comparison of methods for combining p-values from independent tests.” Computational statistics & data analysis 47.3 (2004): 467-485.

Models

To test our hypothesis at the subpathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.

1.) Interaction: \(log Peak = Treatment Group + Time + Treatment*Time\)

2.) Parallel: \(log Peak = Treatment Group + Time\)

3.) Single: \(log Peak = Treatment\)

For the interaction model, we are focusing only on the interaction term “Treatment*Time” to test if there is a significant interaction between our treatment and the time variable. The parallel model is testing if the time variable is explaining a significant amount of the metabolite variance, and the treatment model is testing if the treatment explains a significant proportion of the variance for each metabolite.

We test at the subpathway level using the Combined Fisher Probability method to combine the p-values from each model for all metabolites within the subpathway. To run the subpathway analysis, we use the “subpathway_analysis” function, which requires the following arguments.

  • MetPipe: The experiment data.

  • treat_var: The treatment variable of interest.

  • block_var: The block variable, in our example the block variable is going to be time.

  • strat_var: The name of the variable we want to stratify our analysis by. In our this is going to be “Gender”.

Results Summaries

With the MetabolomicsPipeline package, we provide three different ways to summarize the results from the subpathway analysis.

  1. Number of significant subpathways by model type (subpath_by_model)

  2. Percentage of significant subpathways within superpathways (subpath_within_superpath)

  3. Metabolite model results within a specified subpathway (met_within_sub)

################################################################################
## Stratified Analysis #########################################################
################################################################################

# Stratified Analysis
stratified = subpathway_analysis(dat,
                                     treat_var = "GROUP_NAME",
                                     block_var = "TIME1",
                                     strat_var = "Gender")
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
#> [1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
#> [1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
#> [1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"


################################################################################
### Results Plots ##############################################################
################################################################################

# 1. significant subpathways by model type
subpath_by_model(stratified)
Sigificant Pathways by Model
Model Type Female Male
Interaction 6 84
Parallel 20 12
Single 19 2
None 65 12

# 2. Percentage of signficant subpathways within superpathways
subpath_within_superpath(stratified)
Proportion of significant subpathways within super-pathways
Super Pathway Percent Significant (Female) Percent Significant (Male)
Xenobiotics 5 / 5 (100%) 5 / 5 (100%)
Amino Acid 13 / 16 (81.25%) 16 / 16 (100%)
Peptide 3 / 5 (60%) 3 / 5 (60%)
Nucleotide 3 / 8 (37.5%) 8 / 8 (100%)
Lipid 16 / 53 (30.19%) 46 / 53 (86.79%)
Carbohydrate 2 / 8 (25%) 7 / 8 (87.5%)
Cofactors and Vitamins 2 / 11 (18.18%) 9 / 11 (81.82%)
Energy 0 / 2 (0%) 2 / 2 (100%)
Partially Characterized Molecules 0 / 1 (0%) 1 / 1 (100%)

# 3. Metabolites within subpathway
tables <- met_within_sub(stratified, subpathway = "Partially Characterized Molecules")

### Females
tables[[1]]
Metabolites within Partially Characterized Molecules (Female)
Metabolite Name Interaction P-Value Parallel P-Value Single P-Value
glycolate (hydroxyacetate) 0.437 0.435 0.408
iminodiacetate (IDA) 0.936 0.373 0.017
EDTA 0.709 0.269 0.098
ectoine 0.833 0.742 0.572
sulfate* 0.596 0.880 0.858
ethyl glucuronide 0.293 0.891 0.329
dimethyl sulfone 0.703 0.350 0.908
3-acetylphenol sulfate 0.533 0.033 0.018
benzoylcarnitine* 0.594 0.271 0.067
S-(3-hydroxypropyl)mercapturic acid (HPMA) 0.556 0.346 0.055
O-sulfo-tyrosine 0.558 0.632 0.086
3-hydroxypyridine sulfate 0.470 0.163 0.001
6-hydroxyindole sulfate 0.266 0.292 0.051
1,2,3-benzenetriol sulfate (2) 0.208 0.097 0.000
thioproline 0.493 0.113 0.387
4-acetamidobenzoate 0.069 0.026 0.084
perfluorooctanesulfonate (PFOS) 0.397 0.010 0.199
methylnaphthyl sulfate (1)* 0.443 0.670 0.065
methylnaphthyl sulfate (2)* 0.648 0.633 0.002
2,2’-methylenebis(6-tert-butyl-p-cresol) 0.011 0.000 0.100
3-hydroxy-2-methylpyridine sulfate 0.392 0.083 0.001
3,5-dichloro-2,6-dihydroxybenzoic acid 0.442 0.002 0.253
3-bromo-5-chloro-2,6-dihydroxybenzoic acid* 0.613 0.008 0.208
4-chlorobenzoic acid 0.364 0.538 0.664
perfluorohexanesulfonate (PFHxS) 0.529 0.123 0.230

### Males
tables[[2]]
Metabolites within Partially Characterized Molecules (Male)
Metabolite Name Interaction P-Value Parallel P-Value Single P-Value
glycolate (hydroxyacetate) 0.054 0.069 0.181
iminodiacetate (IDA) 0.052 0.954 0.001
EDTA 0.116 0.987 0.005
ectoine 0.492 0.019 0.284
sulfate* 0.315 0.247 0.891
ethyl glucuronide 0.032 0.114 0.035
dimethyl sulfone 0.002 0.021 0.236
3-acetylphenol sulfate 0.160 0.566 0.837
benzoylcarnitine* 0.378 0.272 0.055
S-(3-hydroxypropyl)mercapturic acid (HPMA) 0.033 0.027 0.113
O-sulfo-tyrosine 0.162 0.020 0.618
3-hydroxypyridine sulfate 0.095 0.100 0.161
6-hydroxyindole sulfate 0.064 0.069 0.498
1,2,3-benzenetriol sulfate (2) 0.293 0.361 0.458
thioproline 0.004 0.931 0.371
4-acetamidobenzoate 0.801 0.190 0.253
perfluorooctanesulfonate (PFOS) 0.058 0.122 0.387
methylnaphthyl sulfate (1)* 0.042 0.064 0.055
methylnaphthyl sulfate (2)* 0.033 0.013 0.067
2,2’-methylenebis(6-tert-butyl-p-cresol) 0.006 0.000 0.419
3-hydroxy-2-methylpyridine sulfate 0.258 0.095 0.369
3,5-dichloro-2,6-dihydroxybenzoic acid 0.043 0.002 0.251
3-bromo-5-chloro-2,6-dihydroxybenzoic acid* 0.044 0.020 0.135
4-chlorobenzoic acid 0.933 0.343 0.617
perfluorohexanesulfonate (PFHxS) 0.026 0.486 0.049

Pairwise Analysis

We can look at the pairwise comparisons for all experimental groups at the metabolite level. Metabolon includes this as part of their analysis. However, if you need to change the model, you must rerun the pairwise analysis. We will use the metabolite_pairwise function within the MetabolomicsPipeline package, which requires the following arguments:

Log Fold-Change Heatmap

We will produce a heatmap of the log fold changes for the metabolites with a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis). The heatmap colors will only show if the log fold-change is greater than log(2) or less than log(.5). Therefore, this heatmap will only focus on comparisons with a fold change of two or greater. The met_est_heatmap function will produce an interactive heatmap using the results from the pairwise analysis.

P-Value Heatmap

Similar to the pairwise estimate heatmap, we will produce a heatmap where the heatmap will only include metabolites with a significant overall p-value, and the values in the heat map will only be colored if the pairwise comparison is significant. We use the
met_p_heatmap function to create an interactive p-value heatmap.

################################################################################
#### Run Pairwise Comparisons ##################################################
################################################################################

strat_pairwise = metabolite_pairwise(dat,form = "GROUP_NAME*TIME1",strat_var = "Gender")


###############################################################################
## Create Estimate Heatmap #####################################################
################################################################################

met_est_heatmap(strat_pairwise$Female, dat)


################################################################################
## Create P-value Heatmap ######################################################
################################################################################

# Female
met_p_heatmap(strat_pairwise$Female, dat)

Boxplots and Lineplots

Visualizations of the data can help us see the underlying trends. Two useful visualizations are boxplots and line plots, we will be using the subpathway_boxplots and subpathway_lineplots functions to create them. The main utility of these functions is it allows you for focus on the metabolites within a subpathway. For both functions, the arguments are:

Boxplots and Lineplots steps

################################################################################
### BoxPlots ###################################################################
################################################################################

subpathway_boxplots(dat, subpathway = "Lactoyl Amino Acid", X=TIME1,
                    groupBy = GROUP_NAME, Gender =="Female")



################################################################################
## Line plots ##################################################################
################################################################################

# Set up data
dat@analysis$TIME1 <- as.numeric(factor(dat@analysis$TIME1,
                                         levels = c("PreSymp","Onset","End")))

# Create line plots 
subpathway_lineplots(dat, subpathway = "Lactoyl Amino Acid",
                     X= TIME1,groupBy = GROUP_NAME, Gender=="Female" )
#> `geom_smooth()` using formula = 'y ~ x'